On URL and content persistence

نویسندگان

  • Daniel Gomes
  • Mário J. Silva
چکیده

This report presents a study of URL and content persistence among 51 million pages from a national web harvested 8 times over almost 3 years. This study differs from previous ones because it describes the evolution of a large set of web pages for several years, studying in depth the characteristics of persistent data. We found that the persistence of URLs and contents follows a logarithmic distribution. We characterized persistent URLs and contents, and identified reasons for URL death. We found that lasting contents tend to be referenced by different URLs during their lifetime. On the other hand, half of the contents referenced by persistent URLs did not change.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Uniform resource locator decay in dermatology journals: author attitudes and preservation practices.

OBJECTIVES To describe dermatology journal uniform resource locator (URL) use and persistence and to better understand the level of control and awareness of authors regarding the availability of the URLs they cite. DESIGN Software was written to automatically access URLs in articles published between January 1, 1999, and September 30, 2004, in the 3 dermatology journals with the highest scien...

متن کامل

The Availability and Persistence of Web References in D-Lib Magazine

We explore the availability and persistence of URLs cited in articles published in D-Lib Magazine. We extracted 4387 unique URLs referenced in 453 articles published from July 1995 to August 2004. The availability was checked three times a week for 25 weeks from September 2004 to February 2005. We found that approximately 28% of those URLs failed to resolve initially, and 30% failed to resolve ...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...

متن کامل

Analyzing the Persistence of Referenced Web Resources with Memento

In this paper we present the results of a study into the persistence and availability of web resources referenced from papers in scholarly repositories. Two repositories with different characteristics, arXiv and the UNT digital library, are studied to determine if the nature of the repository, or of its content, has a bearing on the availability of the web resources cited by that content. Memen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005